Add loader for animovement/aniframe Parquet files#963
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #963 +/- ##
===========================================
- Coverage 100.00% 99.41% -0.59%
===========================================
Files 41 42 +1
Lines 2815 3087 +272
===========================================
+ Hits 2815 3069 +254
- Misses 0 18 +18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add `ValidAniframeParquet` file validator to `validators/files.py` - Add `from_aniframe_file()` loader in new `movement/io/load_aniframe.py` - Register "aniframe" as a recognised `SourceSoftware` in `load.py` - Import `load_aniframe` in `io/__init__.py` to trigger registration - Add `pyarrow` and `rdata` as core dependencies in `pyproject.toml` - Add 57-test suite in `tests/test_unit/test_io/test_load_aniframe.py` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sample files needed on GINBefore this PR can be merged we need at least one (ideally two) real aniframe files on GIN for integration tests. I'll generate these from animovemnt so the embedded R metadata serialisation is exercised end-to-end — the current unit tests cover all edge cases but mock the metadata-decoding step which isn't ideal. Required: 1 × 2D fileThe primary integration test file. It should have:
Nice to have: 1 × 3D fileA small 3D file ( What you don't need separate files forThe following are all covered by the synthetic unit tests and do not need dedicated GIN files: |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace triple nested loop with itertools.product and an inner function. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
So , outstanding things that have cropped up:
|
- Use metadata variables_what/when/where to classify DataFrame columns instead of hardcoded frozensets; missing fields raise ValueError - Extra columns not belonging to any aniframe category are inferred to their minimum xarray dimensions and added as Dataset variables - Constant extra columns are logged at INFO and skipped - Numeric extra columns stored as float64; others as object dtype - Add _extract_meta_vars, _infer_extra_dims, _build_extra_array helpers - Update _resolve_columns to use metadata vars and return extra col list - Expand test suite from 57 to 71 tests covering all new code paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per maintainer guidance: tracks are interpreted as individuals by design. Users who need to stitch tracks should do so before loading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the PR @roaldarbol! So excited that this is finally happening. I'll do a first high-level pass at this PR next week, after which I'll get in touch with you to get sample data files. Sounds like a plan? |
|
Sounds good to me! |
Adds an `extra_var_dims` floor parameter so callers can ensure extra data variables always carry specific dimensions (e.g. `"individuals"`) even in single-individual/single-keypoint files where auto-inference would otherwise collapse that axis away. Accepts a plain string or a tuple of strings for convenience. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract _normalise_extra_var_dims helper to reduce cognitive complexity of from_aniframe_file back within the C901 limit. Shorten two test docstrings to stay within the 79-character line limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up the typer dependency from the typer branch to pre-empt the conflict when that branch merges to main. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
For extra columns, I have made the function automatically find the minimal number of dimensions needed to describe it so it automatically gets the correct dims (e.g. if you have the area of the individual, then in animovement all keypoints will have the identical value - and when imported into movement, it would resolve to be a time, individual variable. The edge case is for single individual/single keypoint cases where extra variables would resolve to just time - to ensure that extra columns be attributes to the correct dimensions I just added an extra argument ( extra_var_dims={
"temperature": ("time",),
"bbox_area": ("time", "individuals"),
"speed": ("time", "individuals"),
}) |
niksirbi
left a comment
There was a problem hiding this comment.
I've made a first pass at this @roaldarbol.
Here are my comments, mostly focusing on design decisions and the questions you'd asked. I'v also left several inline comments, but have not gone much into implementation details (those can be handled later).
pyarrow/rdatadependencies. Pyarrow is a heavy native package , so forcing it on all users for one file format is out of proportion. The good news is that both packages are also available on conda-forge (that's a hard requirement for us). My recommendation would be to gate bothpyarrowandrdatabehind a newaniframeoptional extra. We can also do a runtime check on top of that, as you suggested, with a clear error message ("install withpip install movement[aniframe]orconda install -c conda-forge pyarrow rdata") if someone tries to use the new loader without the requisite dependencies. Onmovement's conda-forge feedstock (I update that upon release), I will add both packages underrun_constrained:, notrun:. That keeps the default conda install lean, and still protects against version incompatibilities if a user installs them separately. This matches the pattern used by other scientific Python projects (includingxarray) for their heavy optional backends. We will need to also update the installation guide accordingly.point_of_reference = "bottom_left"handling. I'm fine with just emitting a warning for now, but you may also want to consider doing the y-flip by default. The reason is that with "bottom_left" the data will not correctly overlay on a video/frame as is. So if we intend to use ournapariGUI as a viewer of aniframe .parquet files (would be neat!), auto-converting tomovement's "top_left" convention would make things easier. In the near future, it would enable someone to drag and drop a video, followed by the parquet file intonapari.- Metadata decoding strategy. The
rdatapackage is lightweight and the full-metadata
path is much better UX than "pass fps / source manually". Keep as-is, but make it part of an optional extra dependency (see point 1). source_softwarefallback whensourceis NA. Let's not require a user to passsource_software. I'm fine with falling back to eitheraniframeor None (as the code currently does).- Time-unit handling. Converting everything to seconds is the right default with fallback to frame units when that's not possible (I think your implementation already takes that approach).
- Extra-column namespace. Unprefixed is fine and matches user expectations; this is the right default.
- Sample data on GIN. The
2D + 3Dspec in the comment thread looks right. Feel free to share the files we me and I will put them on GIN. For each .parquet file, also fill our ametadata.yamlentry as [described here](https://movement.neuroinformatics.dev/latest/community/contributing.html#metadata-yaml-example-entry). For the 2D file, also provide a sample frame at minimum, and optionally the corresponding video if you can share it. This will make it possible to user-test thenaparioverlay. - Loading aniframe files in napari. If we want the .parquet files to be also load-able into
napari, you will have to at minimum updated theSUPPORTED_POSES_FILESconstant inmovement/napari/loader_widgets.py. See also point 2 above.
|
Thanks for such a detailed review @niksirbi! Will go through it later this week! Before I do anything else, I'm trying to look for lightweight alternatives to Edit: From the
For the auto-flipping y-axis, I need to implement something - a metadata field - that keeps the y value, ideally the frame height, but otherwise probably the max. y value from the data (I think that's what I currently use in the readers), which would allow converting between |
Thanks for looking into alternatives! I spent some time looking into them as well. I think we can exclude My reasoning for having it as an optional dependency is that it's heavy-ish and necessary for only a small subset of users (this may change in the future though...). The counter-argument would be to avoid complicating installation instructions. Before we reach a decision on this, I'd be curious to hear why you think it's preferable to keep the deps obligatory.
Ah yeah, that makes sense. If you prefer, you're welcome to proceed here without waiting for that, and tackle the auto-flipping in a future PR (opening an issue to avoid forgetting it). Up to you! |
I think I'll get it done for this PR. My thought is basically, at read time, to encode either (1) the frame height or (2) the max(y) as
For I think my main concern is on the conda-forge side of things, that users need to explicitly add them as extra dependencies. But with no leaner good alternative I think it's just the way it will have to be. :-) In that case, I imagine we should add an informative error, something like: try:
import pyarrow.parquet as pq
except ImportError as e:
raise ImportError(
"Reading Parquet files requires the optional 'aniframe' "
"dependencies (pyarrow, rdata), which are not installed.\n\n"
"Install them with one of:\n"
" pip install 'movement[aniframe]'\n"
" conda install -c conda-forge pyarrow rdata\n"
" pixi add pyarrow rdata\n\n"
"See https://movement.neuroinformatics.dev/latest/user_guide/installation.html"
"for details."
) from eEDIT: I actually quite like the tabbed install instructions, so I don't think it'll be an issue, maybe we'd just need to add it there too? https://movement.neuroinformatics.dev/latest/user_guide/installation.html#install-the-package |
Sounds sensible. For this PR, I would just implement this at .parquet read time
In the call today, we decided to have I think raising an informative error, like the one you suggested, is a great idea. |
Implements the inline and high-level review comments on neuroinformatics-unit#963. Refactor: - Move public `from_aniframe_file` into `load_poses.py` next to other format-specific loaders; keep private helpers in a new `aniframe.py` module (mirroring the `nwb.py` pattern). - Split `ValidAniframeParquet` into `_file_validator + _parquet_validator` and fold the `variables_what/when/where` required-fields check into the validator itself. - Rename `point_of_reference` → `origin` throughout (metadata key, dataset attr, docstrings, tests). Behaviour: - Add `use_frame_numbers_from_file=False` flag so callers can preserve the file's `time` column values (irregular sampling, non-zero start) and convert them to seconds via the now-wired `_TIME_UNIT_TO_SECONDS` lookup. - Drop the `extra_var_dims` parameter; instead auto-expand inferred dims to include any canonical dim that is a singleton in this file, so output shapes stay consistent across files differing only in singleton-dim sizes (e.g. 1- vs 3-individual). - Auto-flip y when `origin == "bottom_left"`. Uses `y_height` from the metadata when present, otherwise falls back to `max(y)` from the data. Sets `ds.attrs["origin"] = "top_left"` and records `y_height` so the flip can be reversed later. The previous bottom-left `UserWarning` is gone — the data is now oriented correctly. Build: - Move `pyarrow` and `rdata` into a new `aniframe` optional extra. Bare `import movement` no longer requires either; the loader (and validator) raise a clear `ImportError` with install instructions only when an aniframe code path is actually invoked. - Update the installation guide with the new extra across all three install tabs (conda-forge, pip, uv) including combined-extras syntax. - Register `aniframe / *.parquet` in the napari plugin's `SUPPORTED_POSES_FILES` so the GUI can browse and load aniframe files. Cleanups: - `_decode_aniframe_metadata` now raises `ValueError` with a clear message on missing `b"r"` key or an undecodable R blob (was log-and-return-`{}`, which produced a confusing two-step error). - Special-case `bool` dtype in `_build_extra_array` (previously coerced to `float64` with NaN fill via `is_numeric_dtype`). - `individual_names` now uses first-occurrence ordering to match `keypoint_names`. - Drop the redundant `or unit_time == "s"` clause in `_resolve_fps`. - Docstring fixes: link to the aniframe spec, example uses `from_aniframe_file` directly, `_resolve_columns` notes "INFO log" instead of "warning". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Alright, now I've spent a few hours going through the comments and addressing them. Here's a few things I think are worth flagging:
|
SonarQube flagged the direct == checks on ds.attrs["y_height"] in the new bottom_left → top_left flip tests. Switch to pytest.approx, matching the existing fps assertions in the same file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Additional point that we should discuss:
|
|
Thanks for the updates @roaldarbol. Have been busy with other projects this week but I will have another pass on this next week latest. I'll also think about the sessions/trials question. |
|
Btw, it's looking like #973 is going to be merged ahead of this one, so you will be able to do away with the singular-plural conversion in this PR. |
|
Follow-up for the
|



Summary
Implements a reader for the aniframe format (Parquet files produced by the animovement R ecosystem), enabling data exchange between the
movementPython package and theaniframe/anireadR packages.Closes #307.
Background
The aniframe format is a long-format tidy data frame (one row per
individual x keypoint x time) serialised as a Parquet file with rich metadata stored in the file-level schema metadata. The ecosystem includes:The aniread
read_movement()function already reads movement netCDF into aniframe; this PR implements the reverse direction.Format overview (confirmed from a real aniframe Parquet file)
Columns (three conceptual slots)
individual,keypointmodel,tracktimesession,trialx,yz,rho,phi,thetaconfidenceColumn types observed:
individualasint32,keypointas Arrow dictionary (categorical),session/trialasint32,timeasint32,x/y/confidenceasfloat64.Parquet metadata
The aniframe metadata is stored in the Parquet file-level schema metadata under the key
b'r', serialised using R's native ASCII serialisation format (version 3, startsA\n3\n...). It is not JSON. Fields include:sourcesampling_rateunit_time"frame"unit_space"px"unit_angle"rad"reference_frame"allocentric"coordinate_system"cartesian_2d"point_of_reference"bottom_left"variables_what/variables_when/variables_whereDesign decisions
Column mapping: aniframe -> movement
individualindividualskeypointkeypointstrackindividualstimetimecoordx,yspace = ["x", "y"]x,y,zspace = ["x", "y", "z"]confidenceconfidencevariableExtra
variables_what/variables_whencolumns (e.g.,model,session,trial): the general rule is whether the extra columns are resolvable -- i.e., they contain only a single unique value and can be safely dropped:A file with
session=1andtrial=1(constants) would load cleanly under this rule.Polar/spherical coordinates (
rho,phi,theta): error out -- movement only supports Cartesianspacecoordinates.Metadata mapping
.attrssourcesource_software"SLEAP","DeepLabCut"); fall back to"aniframe"ifsourceis NAsampling_ratefpsunit_timetime_unitfpsfromsampling_rateunit_spacespace_unit(custom)reference_framereference_frame(custom)point_of_referencepoint_of_reference(custom)"bottom_left"(movement/napari convention is top-left origin)Time unit handling: automatically convert any
unit_timevalue to seconds and derivefpsfromsampling_ratewhere available:"frame"-> pass raw frame numbers tofrom_numpy()withoutfps(time axis = original integers, e.g. starting at 1)"s"-> already in seconds; setfps = sampling_rate"ms","us","ns"-> divide to seconds; setfps = sampling_rate"m","h"-> convert to seconds; setfps = sampling_rateScope
keypoint = "centroid") is deferred.read_movement().Dependencies
pyarrowxarray[accel,io,viz]does not pull it in transitively (confirmed by inspection of the installed environment).check_installedcheck following the pattern in aniread'scheck_arrow().R metadata decoding (
rdata)rdatais a pure-Python R serialisation parser. Its only dependencies arenumpy,xarray, andpandas-- all already required by movement. This makes it a low-cost addition with no new transitive dependencies.variables_what/when/wherefrom column names (same heuristics aniframe itself uses), and require the user to passfps/source_softwareexplicitly.Open questions / items for discussion
pyarrowas core vs optional dependency: Not currently in the dep tree. Add to[project.dependencies], or guard with a runtime check and ask users to install it as needed?point_of_referencecoordinate flip: aniframe defaults to"bottom_left"origin; movement/napari conventionally uses top-left (image coordinates). For now the loader will warn and leave data as-is. Automatic y-axis flipping could be added later.Metadata decoding strategy: Use
rdata(pure Python, no new transitive deps) for full metadata, or infer from column names and accept thatsource_software/fps/unit_timemay need to be provided by the caller?source_softwarefallback when aniframesourceis NA: use"aniframe", or require the caller to passsource_softwareexplicitly?Files to create / modify
New files
movement/io/load_aniframe.py-- loader functionfrom_aniframe_file()and internal parsing helperstests/test_unit/test_io/test_load_aniframe.py-- unit testsModified files
movement/validators/files.py-- addValidAniframeParquetattrs classmovement/io/load.py-- updateSourceSoftwaretype alias; import new loadermovement/io/__init__.py-- importload_aniframeto trigger loader registrationdocs/source/user_guide/input_output.md-- add row to supported formats tablepyproject.toml-- addpyarrow; considerrdatamovement/sample_data.py-- register sample aniframe Parquet file (once added to GIN)Test plan
position,confidence,individuals,keypoints,time,fps,source_softwaretrackcolumn renamed toindividualwith aUserWarningmodelcolumn with multiple values ->ValueErrormodelcolumn with single value -> dropped with info logsession/trialwith multiple values ->ValueErrorsession/trialwith single value -> dropped with info logvariables_where->ValueErrorconfidencecolumn -> filled withNaNx,y,z) ->space = ["x", "y", "z"]unit_time = "ms"+sampling_rate = 30-> correct seconds time axis andfps = 30unit_time = "frame"-> time axis preserved as original integers (e.g. starting at 1)point_of_reference = "bottom_left"->UserWarningemittedsource_softwareforwarded from aniframesourcemetadatasource_software="auto") correctly identifies.parquetfilesGenerated with Claude Code